4  Chapter 4: Future directions

The missing data in mIF image can be seen as missing completely at random (MCAR). In the missing data scenarios described in Chapter 3, none of them can be predicted. However, for the low-plex image channel imputation methods, the markers are selected by machine learning methods to maximize the accuracy. For this type of imputation, the imputation results should be taken with a grain of salt as this is missing not at random (MNAR). It might be possible to deduce the amount of bias this creates through statistical inference, but for now this is not in the scope of this project.

All methods in Chapter 3 evaluated the accuracy of their result by either comparing with the imputation outcome of a previous method or with a held-out evaluation dataset. In addition, Wu et al. (2023) in Section 3.2.1 used the imputation output for cell phenotyping and predicting patient phenotypic outcomes. In principle, this is not the best practice of subsequent analysis with the imputation outcome. Such prediction should be performed with multiple imputation at least, to obtain a reasonable estimation of confidence intervals.

Rubin (1996) states that it is important to be statistically valid for estimates of scientific estimands, such as population mean. Statistical validity for an estimand means an at least approximately non-biased point estimate, and an statistical test that rejects the null hypothesis no less than 5% of the time, when the nominal significance level is 5% (Rubin 1996; Van Buuren 2018). Under such guideline, we can check the multiple imputation estimand’s expectation and variance. Van Buuren (2018) presented a clear interpretation of such, conditioning on observed data: Let \(Y_{mis}\) be missing data, and \(Y_{obs}\) be observed data, \(Q\) be the population value and \(\hat Q\) is the estimate of \(Q\). The posterior distribution of \(Q\) given observed data is \[ (Q|Y_{obs})=\int P(Q|Y_{obs}, Y_{mis})P(Y_{mis}|Y_{obs})dY_{mis} \]

The posterior mean of \(Q\) is therefore \[ E[Q|Y_{obs}]=E[E(Q|Y_{obs}, Y_{mis})|Y_{obs}] \]

Which can be interpreted as the average of estimates over repeatedly imputed data, given the observed data.

The posterior variance is therefore

\[ Var(Q|Y_{obs})=E[Var(Q|Y_{obs}, Y_{mis})|Y_{obs}]+Var[E(Q|Y_{obs}, Y_{mis})|Y_{obs}] \] Which can be interpreted as the within-variance of the estimate in each imputed data, and the between-variance among the repeatedly imputed data.

In this case, single imputation would be an unbiased estimator. However, it will be biased in variance estimation, as it does not incorporate between-variance at all. Therefore, for better accuracy with estimate variance, multiple imputation should be used. The subsequent project could use the colon map data in GammaGateR project, where some channels are missing, and compare the validity of confidence interval for the estimated quantities.